Speaker-Independent Phone Recognition Using BREF
نویسندگان
چکیده
A series of experiments on speaker-independent phone recognition of continuous speech have been carried out using the recently recorded BREF corpus. These experiments are the first to use this large corpus, and are meant to provide a baseline performance evaluation for vocabulary-independent phone recognition of French. The HMM-based recognizer was trained with hand-verified data from 43 speakers. Using 35 context-independent phone models, a baseline phone accuracy of 60% (no phone grammar) was obtained on an independent test set of 7635 phone segments from 19 new speakers. Including phone bigram probabilities as phonotactic constraints resulted in a performance of 63.5%. A phone accuracy of 68.6% was obtained with 428 context dependent models and the bigram phone language model. Vocabulary-independent word recognition results with no grammar are also reported for the same test data. I N T R O D U C T I O N This paper reports on a series of experiments for speakerindependent, continuous speech phone recognition of French, using the recently recorded BREF corpus[4, 6]. BREF was designed to provide speech data for the development of dictation machines, the evaluation of continuous speech recognition systems (both speaker-dependent and speakerindependent), and to provide a large corpus of continuous speech to study phonological variations. These experiments are the first to use this corpus, and are meant to provide a baseline performance evaluation for vocabulary-independent (VI) phone recognition, as well as the development of a procedure for automatic segmentation and labeling of the corpus. First a brief description of BREF is given, along with the procedure for semi-automatic (verified) labeling and automatic segmentation of the speech data. The ability to accurately predict the phone labels from the text is assessed, as is the accuracy of the automatic segmentation. Next the phone recognition experiments performed using speech data from 62 speakers (43 for training, 19 for test) are described. A hidden Markov model (HMM) based recognizer has beeen evaluated with context-independent (CI) and context-dependent (CD) model sets, both with and without a duration model. Results are also given with and without the use of 1-gram and 2-gram statistics to provide phonotactic constraints. Preliminary VI word recognition results are presented with no grammar. The final section provides a discussion and summary, and a comparison of these results to the performance of other phone recognizers. T H E B R E F C O R P U S BREF is a large read-speech corpus, containing over 100 hours of speech material, from 120 speakers. The text materials were selected verbatim from the French newspaper Le Monde, so as to provide a large vocabulary (over 20,000 words) and a wide range of phonetic environments[4]. Containing 11 i5 distinct diphones and over 17,500 triphones, BREF can be used to train VI phonetic models. Hon and Lee[5] concluded that for VI recognition, the coverage of triphones is crucial. Separate text materials, with similar distributionalproperties were selected for training, development test, and evaluation purposes. The selected texts consist of 18 "all phoneme" sentences, and approximately 840 paragraphs, 3300 short sentences (12.4 words/sentence), and 3800 longer sentences (21 words/sentence). The distributional properties for the 3 sets of texts, and the combined total, are shown in Table 1. The sets are distributionally comparable in terms of their coverage of word and subword units and quite similar in their phone and diphone distributions. For comparison, the last column of the table gives the distributional properties for the original text of Le Monde. Each of 80 speakers read approximately 10,000 words (about 650 sentences) of text, and an additional 40 speakers each read about half that amount. The speakers, chosen from a subject pool of over 250 persons in the Paris area, were paid for their participation. Potential subjects were given a short reading test, containing selected sentences from Le Monde representative of the type of material to be recorded[6] and subjects judged to be incapable of the task were not recorded. The recordings were made in stereo in a soundisolated room, and were monitored to assure the contents. Thus far, 80 training, 20 test, and 20 evaluation speakers have been recorded. The number of male and female speakers for each subcorpus is given in Table 2. The ages of the speakers range from 18 to 73 years, with 75% between the ages of 20 and 40 years. In these experiments only a subset of the training and development test data was used, reserving the evaluation data for future use.
منابع مشابه
Experiments on Speaker-Independent Phone Recognition Using BREF
A series of experiments for speaker-independent, continuous speech phone recognition have been carried out using the recently recorded BREF corpus. Our experiments are the rst to use this database, and are meant to provide a baseline performance evaluation for vocabulary independent phone recognition. The system was trained using hand-veriied data from 43 speakers. Using 35 context-independent ...
متن کاملCross-Lingual Experiments with Phone Recognition
This paper presents some of the recent research on speaker-independent continuous phone recognition for both French and English. The phone accuracy is assessed on the BREF corpus for French, and on the Wall Street Journal and TIMIT corpora for English. Cross-language differences concerning language properties are presented. It was found that French is easier to recognize at the phone level (the...
متن کاملHigh performance speaker-independent phone recognition using CDHMM
In this paper we report high phone accuracies on three corpora: WSJ0, BREF and TIMIT. The main characteristics of the phone recognizerare: high dimensional feature vector (48), contextand genderdependent phone models with duration distribution, continuous density HMM with Gaussian mixtures, and n-gram probabilities for the phonotatic constraints. These models are trained on speech data that hav...
متن کاملContinuous Speech Recognition at LIMSI
This paper presents some of the recent research on speaker-independent continuous speech recognition at LIMSI including efforts in phone and word recognition for both French and English. Evaluation of an HMMbased phone recognizer on a subset of the BREF corpus, gives a phone accuracy of 67.1% with 35 context-independent phone models and 74.2% with 428 context-dependent phone models. The word ac...
متن کاملA phone-based approach to non-linguistic speech feature identification
In this paper we present a general approach to identifying non-linguistic speech features from the recorded signal using phone-based acoustic likelihoods. The basic idea is to process the unknown speech signal by feature-specific phone model sets in parallel, and to hypothesize the feature value associated with the model set having the highest likelihood. This technique is shown to be effective...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1992